The Voynich Ninja

Full Version: Interesting result for f3r with an ngram Genetic Algorithm
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
Quote:Julian:
I think the statistics will look very different for an n-gram substitution scheme with nulls, as opposed to a simple substitution scheme which is clearly discounted.


It certainly would. The difficulty is in figuring out which ones might be nulls (I've been struggling with this for years).

As soon as you postulate nulls, you have to assume the spaces (or some of the spaces) might also be nulls.

And... as soon as you try interpreting the spaces in different ways, Capelli goes out the window, because a high proportion of the patterns in the VMS text that behave in ways similar to Latin abbreviations (or appear to) are at the beginnings and ends of tokens. Re-interpreting spaces changes the whole expansion process.
(30-11-2017, 05:38 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(30-11-2017, 04:44 PM)farmerjohn Wrote: You are not allowed to view links. Register or Login to view.
(30-11-2017, 04:13 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Curiously it lacks some words that are found in all dictionaries. No idea why.

Which ones for example?

eclipsis is missing
meridionalis is missing
inrationabilis is present but rationabilis is missing... that's unreasonable. Smile

Also some verb tenses are maybe missing (and supine) and some comparatives (for example: superiores).

Hm... It seems that the wordlist you mentioned was generated from older version of Whitaker's Words, because even You are not allowed to view links. Register or Login to view. does accept all four your examples (and current online version is not the newest; the newest one can be found at the same site somewhere, it consists of command line app and source files for dictionary).
(30-11-2017, 09:06 PM)julian Wrote: You are not allowed to view links. Register or Login to view.
(30-11-2017, 08:07 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.
(30-11-2017, 06:12 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I am looking for a way to reward small improvements for the GA to chew on without selecting too much fake Latin. The idea is not to make a lorem ipsum generator. So I was thinking about something like this:
- count actual Latin words as correct with score = word length
- count common Latin bigrams and trigrams (a selection of 20-50 for example) as correct with score = a small percentage of their total length

The statistics of Voynichese make clear that it cannot be a simple substitution code for Latin. Independently from the specific language, I think it could be a good idea to consider frequency as well:
autem and super should have higher scores than seror and loris.

I think the statistics will look very different for an n-gram substitution scheme with nulls, as opposed to a simple substitution scheme which is clearly discounted.

Hi Julian,
if I understand correctly, you are assuming that space are relevant: each Voynichese words corresponds to a single Latin word. I think there is plenty of evidence suggesting that spaces are relevant, so this assumption seems reasonable to me. 

Have you seen this page by Rene?
You are not allowed to view links. Register or Login to view.

I reproduce one of his graphs (ignore the digraph curves at the bottom).
red 24 pages of herbal-A, in Curva
blue 24 pages of herbal-B, in Curva
green Genesis (Vulgate)
grey De Bello Gallico (Latin)

The four lines illustrate different word "types" as a function of the length of a text sample.
The gray line (Cesar) tells us that 2000 words of this text typically include about 1100 different words. The figures for Voynichese are close (about 1000 different words). The other Latin sample (the Vulgate) provides about 750 different words.

Since the figures are comparable, as Rene writes, a Latin to Voynich (or Voynich to Latin) conversion should "maintain the length of the vocabulary". The introduction of nulls can only reduce the vocabulary (e.g. if EVA:t is a null, choty and choy will become a single word). If one considers the Vulgate as a bottom bound, there is some margin for the introduction of nulls, but not much.

Also, Latin words tend to be averagely longer than EVA words. Word length is harder to match if one introduces nulls.
The two attached histograms are based on the length of word occurrences (not word "types" as in Rene's graph discussed above).

Latin has a spike for words 2 characters long. These correspond to a number of hugely frequent particles like prepositions (in, ex, ab, de...), conjunctions  (et, ac...), pronouns (id, tu, me...). 
Latin also has a long tail of long words, which of course get less frequent with the growing of length. In the vulgate, 9% of the words are 10 or more characters long.

Voynichese has a simpler distribution, with a single peak corresponding to length 5 (corresponding to the central, but secondary, peak for Latin). Long words are rarer: less than 1% are 10 or more characters long.

The mapping from Voynichese to Latin should stretch the histogram in two opposite directions, possibly by making short words shorter and long words longer.
(01-12-2017, 05:57 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.The two attached histograms are based on the length of word occurrences (not word "types" as in Rene's graph discussed above).

Latin has a spike for words 2 characters long. These correspond to a number of hugely frequent particles like prepositions (in, ex, ab, de...), conjunctions  (et, ac...), pronouns (id, tu, me...). 
Latin also has a long tail of long words, which of course get less frequent with the growing of length. In the vulgate, 9% of the words are 10 or more characters long.

Voynichese has a simpler distribution, with a single peak corresponding to length 5 (corresponding to the central, but secondary, peak for Latin). Long words are rarer: less than 1% are 10 or more characters long.

The mapping from Voynichese to Latin should stretch the histogram in two opposite directions, possibly by making short words shorter and long words longer.

Thanks, Marco.

Based on comparisons of Voynich word frequencies in Herbal A and Herbal B, the following pops out:
  • EVA-d is a null. Or optional. 
This substitution places EVA-aiin at the top of the Herbal word frequency list for language A and B.


It looks like the Language A scribe always wrote EVA-daiin whereas the Language B scribe sometimes wrote EVA-daiin and sometimes EVA-aiin. I haven't looked to see if s/he switched to EVA-aiin in the later Language B folios - that would be a nice finding if so.

If you replace EVA-d with a null, and recompute the word statistics, you find that EVA-aiin occurs about one and half times as often as the next most common word, in both Language A and B, which I think lends credence to the postulate that EVA-d is a null  Big Grin 

Going further, replacing EVA-d with a null in the whole manuscript, and looking at the resulting top frequency words leads to another pop out:
  • EVA-chol (Language A) is the same word as EVA-chey (Language B)
  • EVA-chor (Language A) is the same word as EVA-shey (Language B)
Food for thought  Tongue
(01-12-2017, 10:27 PM)julian Wrote: You are not allowed to view links. Register or Login to view.EVA-d is a null. Or optional.

Which makes EVA-d a perfect candidate for diminutive suffix Big Grin
(01-12-2017, 10:27 PM)julian Wrote: You are not allowed to view links. Register or Login to view.
(01-12-2017, 05:57 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.The two attached histograms are based on the length of word occurrences (not word "types" as in Rene's graph discussed above).

Latin has a spike for words 2 characters long. These correspond to a number of hugely frequent particles like prepositions (in, ex, ab, de...), conjunctions  (et, ac...), pronouns (id, tu, me...). 
Latin also has a long tail of long words, which of course get less frequent with the growing of length. In the vulgate, 9% of the words are 10 or more characters long.

Voynichese has a simpler distribution, with a single peak corresponding to length 5 (corresponding to the central, but secondary, peak for Latin). Long words are rarer: less than 1% are 10 or more characters long.

The mapping from Voynichese to Latin should stretch the histogram in two opposite directions, possibly by making short words shorter and long words longer.

Thanks, Marco.

Based on comparisons of Voynich word frequencies in Herbal A and Herbal B, the following pops out:
  • EVA-d is a null. Or optional. 
This substitution places EVA-aiin at the top of the Herbal word frequency list for language A and B.


It looks like the Language A scribe always wrote EVA-daiin whereas the Language B scribe sometimes wrote EVA-daiin and sometimes EVA-aiin. 


Hi Julian,
as I wrote, if you treat 'd' (or any other character) as a null, you get shorter words. I don't think this makes the Voynichese histogram closer to the bimodal distribution of Latin word lengths. It certainly makes the matching of the frequent long Latin words even harder.

About daiin / aiin, a few month ago I analysed some aspects of Voynich reduplication (the repeated occurrence of the same words, like daiin.daiin).  These graphs (fromYou are not allowed to view links. Register or Login to view.) compare actual reduplicated occurrences of words with what would be statistically expected on the basis of word frequency (separately for Currier A and B).

[Image: attachment.php?aid=1645]

It turns out that a major difference between aiin and daiin is that daiin reduplicates less than expected but consistently (in both A and B). On the other hand, aiin never does: it is the most frequent word exhibiting this peculiar behaviour. The examples of reduplication I have seen in other languages are semantic (they often imply the intensification of a concept, e.g. "very very good"). The fact that the two words have different reduplication patterns suggests that they might have different meanings and possibly belong to different word classes (random example: one is an adjective, the other a conjunction). 

[I am very grateful to Emma May Smith for discussing the aiin / daiin stats with me: my opinions on the subject are largely dependent of those exchanges]
(02-12-2017, 11:04 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.
(01-12-2017, 10:27 PM)julian Wrote: You are not allowed to view links. Register or Login to view.
(01-12-2017, 05:57 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.The two attached histograms are based on the length of word occurrences (not word "types" as in Rene's graph discussed above).

Latin has a spike for words 2 characters long. These correspond to a number of hugely frequent particles like prepositions (in, ex, ab, de...), conjunctions  (et, ac...), pronouns (id, tu, me...). 
Latin also has a long tail of long words, which of course get less frequent with the growing of length. In the vulgate, 9% of the words are 10 or more characters long.

Voynichese has a simpler distribution, with a single peak corresponding to length 5 (corresponding to the central, but secondary, peak for Latin). Long words are rarer: less than 1% are 10 or more characters long.

The mapping from Voynichese to Latin should stretch the histogram in two opposite directions, possibly by making short words shorter and long words longer.

Thanks, Marco.

Based on comparisons of Voynich word frequencies in Herbal A and Herbal B, the following pops out:
  • EVA-d is a null. Or optional. 
This substitution places EVA-aiin at the top of the Herbal word frequency list for language A and B.


It looks like the Language A scribe always wrote EVA-daiin whereas the Language B scribe sometimes wrote EVA-daiin and sometimes EVA-aiin. 


Hi Julian,
as I wrote, if you treat 'd' (or any other character) as a null, you get shorter words. I don't think this makes the Voynichese histogram closer to the bimodal distribution of Latin word lengths. It certainly makes the matching of the frequent long Latin words even harder.

About daiin / aiin, a few month ago I analysed some aspects of Voynich reduplication (the repeated occurrence of the same words, like daiin.daiin).  These graphs (fromYou are not allowed to view links. Register or Login to view.) compare actual reduplicated occurrences of words with what would be statistically expected on the basis of word frequency (separately for Currier A and B).

[Image: attachment.php?aid=1645]

It turns out that a major difference between aiin and daiin is that daiin reduplicates less than expected but consistently (in both A and B). On the other hand, aiin never does: it is the most frequent word exhibiting this peculiar behaviour. The examples of reduplication I have seen in other languages are semantic (they often imply the intensification of a concept, e.g. "very very good"). The fact that the two words have different reduplication patterns suggests that they might have different meanings and possibly belong to different word classes (random example: one is an adjective, the other a conjunction). 

[I am very grateful to Emma May Smith for discussing the aiin / daiin stats with me: my opinions on the subject are largely dependent of those exchanges]

That's very interesting, Marco. Of course, if EVA-d is a null, then EVA-aiin would reduplicate more often. 

The Language A and B features need much more investigation like this: they've always seemed to be a massive clue as to what is going on.
(02-12-2017, 08:01 PM)julian Wrote: You are not allowed to view links. Register or Login to view.That's very interesting, Marco. Of course, if EVA-d is a null, then EVA-aiin would reduplicate more often. 

Hi Julian,
I think that reduplication in Voyinchese looks like reduplication in other languages. One of the peculiarities of Voynichese, with respect to Latin and the other European languages I know something of, is that reduplication is much more frequent. I believe this is one of the elements that suggest that the language is not one of the obvious ones (but reduplication could in principle be a feature of the text and not of the language).
I think that the simpler explanation for reduplication is that it really is what it seems: multiple consecutive occurrences of the same word. This typically has a semantic value. I am not aware of useful alternatives to this simple explanation: the alternatives I am aware of are of the kind "it's gibberish" and I don't find them interesting.

In the case of daiin and aiin, the presence and absence of d- objectively correlate with the presence or absence of reduplication. If reduplication is what it appears to be (the meaningful repetition of words), d- seems likely to be meaningful as well.
Nulls are a bit problematic for the Voynich MS text.

In the most simple case, nulls would be inserted into the text at random.

However, this would have a big negative impact on the adherence of the text to Zipf's law.
Furthermore, the null character(s) would appear in combination with almost all other characters.

Both of these things we are not seeing in the Voynich MS text. Combinations are quite restricted for all characters, and the Zipf law is observed quite reasonably.

The alternative would be that the nulls are added according to some rule, in such a way that the two effects above do not occur.
In that case, it is more like a verbose cipher.
If this is really what was done, then an n-to-m gram substitution might be the solution, and why not attack it with a genetic algorithm indeed.
Pages: 1 2 3